Natalia Organek - text classification

Get acquainted with the data of the Polish Cyberbullying detection dataset.

Pay special attention to the distribution of the positive and negative examples in the first task as well as distribution of the classes in the second task.

Preprocessing

Task 1 - o wiele więcej jest instancji bez raniących wyrażeń niż z nimi.

Task 2 - miażdżącą przewagę mają wyrażenia neutralne, o wiele mniej jest wyrażeń zawierających hejt, najmniej które zawierają cyberprzemoc.

Zbiory są BARDZO niezrównoważone, więc zrównoważę ich zbiory treningowe.

Train the following classifiers on the training sets (for the task 1 and the task 2):

Bayesian classifier with TF * IDF weighting.

Taken from this link

Fasttext text classifier

Transformer classifier (take into account that a number of experiments should be performed for this model).

Compare the results of classification on the test set.

Select the appropriate measures (from accuracy, F1, macro/micro F1, MCC) to compare the results.

Task 1

Task 2

Select 1 TP, 1 TN, 1 FP and 1 FN from your predictions

(for the best classifier) and compare the decisions of each classifier on these examples using LIME.

BEST WAS FASTTEXT SO IT WILL BE USED.

LIME

TASK 1

TFIDF
Transformer
Fasttext

TASK 2

TFIDF
TRANSFORMER
FASTTEXT

Answer the following questions:

Which of the classifiers works the best for the task 1 and the task 2.

For both tasks best is fasttext classifier:

TASK 1

TFIDF Accuracy: 0.783 F1: 0.7878030091930415 Macro F1: 0.5534161605726151 Micro F1: 0.7829999999999999 MCC: 0.10745050407812812

FASTTEXT Accuracy: 0.877 F1: 0.8494275537721001 Macro F1: 0.6226519286167892 Micro F1: 0.8769999999999999 MCC: 0.3111077063768808

TRANSFORMER Accuracy: 0.866 F1: 0.8038113612004287 Macro F1: 0.4640943193997856 Micro F1: 0.866 MCC: 0.0

TASK 2

TFIDF Accuracy: 0.776 F1: 0.7774204426600595 Macro F1: 0.3257908913986824 Micro F1: 0.776 MCC: 0.06054021107458979

FASTTEXT Accuracy: 0.855 F1: 0.8123805420064407 Macro F1: 0.32823970319591705 Micro F1: 0.855 MCC: 0.14189520715595239

TRANSFORMER Accuracy: 0.134 F1: 0.03166843033509701 Macro F1: 0.11816578483245152 Micro F1: 0.134 MCC: 0.0

Both task datasets are very imbalanced so accuracy is not a good metric (especially if it is high - model could only answer 0 for every answer and accuracy would still be high). But F1 and MCC are good metrics for this and Fasttext was the best for both tasks.

For first task next was Transformer and the worst was TFIDF. But F1 metrics here are all kind of high (>0.78) here.

For second task second best model was TFIDF and the worst (really, really bad performance) had Transformer (maybe it should be thought for more than 3 epochs, but one epoch lasted for 1.5 hour).

Did you achieve results comparable with the results of PolEval Task?

Fasttext - I have higher results which is... disturbing. (F1 for task1|F1-min task2: 0.85|0.88 vs 0.4135|0.4722)

Did you achieve results comparable with the Klej leaderboard?

This is for transformers so:

For first task - let's say yes, because on test dataset it was even high but let's face it on training dataset the acuracy was very (low sparse_categorical_accuracy: 0.5021).

For second task - it's not even worth commenting, it is really low. (which could be useful in 2-class classification :D )

Describe strengths and weaknesses of each of the compared algorithms.

First - analysis

Disclaimer 1 - more results is visible when opening this notebook as html or in colab https://colab.research.google.com/drive/10DcPgyE7Q3wRO3lE_ujK3dUuqK2RUIfM?usp=sharing - some visualizations are missing in ipynb opended in jupyter

Disclaimer 2 - Positive score - there is CB in sentence; positive word - it is just positive (or normal) - doesn't bring CB to the sentence :D sorry for this mess

task 1

**TP

'@anonymized_account Dokładnie wie co mówi. A Ty pajacu poczytaj ustawę domsie dowiesz kto decyduje o wysokości zarobków w samorządach.'

Only fasttext recognises that word 'pajacu' is strongly negative. Transformer sees it as positive and TFIDF as slightly negative. Also, fasttext understands that 'Ty' is often a part of bullying. For rest negative were ustawę, dowiesz. TFIDF was almost sure that this sentence have no CB because of words poczytaj, dowiesz, decyduje, wysokości.Transformer was not really sure (0.52 vs 0.48) but chose NO CB class and I don't understand its choices - it trats 'pajacu' and 'ustawę' as positive words and 'Dokładnie, wie, mówi' as positive ones.

TN '@anonymized_account Spoko, jak im Duda z Morawieckim zamówią po pięć piw to wszystko będzie ok.'

Fasttext - ok words: spoko, Morawieckim(¯_(ツ)_/¯), będzie, ok; not ok: Duda - seems almost logical to me

Transformer - ok words: Morawieckim, zamówią, będzie; not ok: anonymyzed account, spoko

TFIDF - ok: Morawieckim, pięć, ok; not ok: z, będzie, Duda

Every model here recognised this sentence as NO CB so it was ok, and they had similar premises. Also, it seems that there were more negative comments about Mr. Duda than Mr. Morawiecki in 2019 Twitter ;) (or - to be more precise - in this dataset)

FP '@anonymized_account @anonymized_account Kto mieczem wojuje, ten od pochwy ginie'

TFIDF - I got some peculiar results as model was sure that it had CB (100%) but word analysis shows that every single word in this sensentce has positive impact (I mean - it has no bulying).

TRANSFORMER - not ok words: mieczem, ginie, wojuje - violent words; ok: kto, od, pochwy

FASTTEXT - model was not really sure (0.49 vs 0.51) if it has cyberbullying. Wojuje has the most negative impact and only 2 words positive (ten, od).

FN @anonymized_account Tej szmaty się nie komentuje

FASTTEXT - Model is 100% sure that every word in the sentence shows that this sentence has no CB

TRANSFORMER - only 'komentuje' word is negative, 'szmaty' is the most positive word for the model

TFIDF - word 'komentuje' is the only valuable word and it shows that sentence has no CB

TASK 2:

FASTTEXT:

TFIDF:

TRANSFORMER - this score should not be taken into consideration - it has REALLY low score and LIME shows almost nothing.

Strongs and weaknesses

Do you think comparison of raw performance values on a single task is enough to assess the value of a given algorithm/model?

Not really, performance depends also of time - transformer learning was a nightmare (as I didn't have GPU and did it on google colab which have come usage constraints) and after about 5 hrs of learning it had better performance than TFIDF but I'm not really sure if it was worth it. Not for uni labs. No.

Yet, it is important if model can explain itself (with LIME ofc). Like, for transformer, words seemed to have really low impact on score - (like 10e-8)(I don't know what had, really, maybe that were those words but just scoring was strange, or I did a mistake).

Next thing - some models can be better in other tasks - it is shown on Kley Leaderboard.

And - very much depends on dataset (instances no, class no, sentences length, repeated words (like anonymised_accout)).

Did LIME show that the models use valuable features/words when performing their decision?

It depended.